Wine Quality Prediction

WINE!!!

Through this project we embark on an exhilarating journey through the world of wine, where we explore the delicate art of predicting wine quality. With each sip and swirl, we delve into the enchanting realm of data analysis and visualization, uncovering the hidden gems within the chemical attributes that define wine excellence.


LOADING THE DATASETS

wine=read.csv("winequalityN.csv")
xkabledplyhead(wine)
Head
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6
white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6
white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
xkabledplytail(wine)
Tail
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
6493 red 6.2 0.600 0.08 2.0 0.090 32 44 0.995 3.45 0.58 10.5 5
6494 red 5.9 0.550 0.10 2.2 0.062 39 51 0.995 3.52 NA 11.2 6
6495 red 6.3 0.510 0.13 2.3 0.076 29 40 0.996 3.42 0.75 11.0 6
6496 red 5.9 0.645 0.12 2.0 0.075 32 44 0.996 3.57 0.71 10.2 5
6497 red 6.0 0.310 0.47 3.6 0.067 18 42 0.996 3.39 0.66 11.0 6
#xkablesummary(wine)

We have succesfully loaded the dataset. We can see that there are 6497 observations and 13 variables.

Next, lets look at the summary statistics.


SUMMARY STATISTICS

xkablesummary(wine)
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:6497 Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.22 Min. : 8.0 Min. :3.00
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.250 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 17.0 1st Qu.: 77 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.43 1st Qu.: 9.5 1st Qu.:5.00
Median Mode :character Median : 7.00 Median :0.29 Median :0.310 Median : 3.0 Median :0.047 Median : 29.0 Median :118 Median :0.995 Median :3.21 Median :0.51 Median :10.3 Median :6.00
Mean NA Mean : 7.22 Mean :0.34 Mean :0.319 Mean : 5.4 Mean :0.056 Mean : 30.5 Mean :116 Mean :0.995 Mean :3.22 Mean :0.53 Mean :10.5 Mean :5.82
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.40 3rd Qu.:0.390 3rd Qu.: 8.1 3rd Qu.:0.065 3rd Qu.: 41.0 3rd Qu.:156 3rd Qu.:0.997 3rd Qu.:3.32 3rd Qu.:0.60 3rd Qu.:11.3 3rd Qu.:6.00
Max NA Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289.0 Max. :440 Max. :1.039 Max. :4.01 Max. :2.00 Max. :14.9 Max. :9.00
NA NA NA’s :10 NA’s :8 NA’s :3 NA’s :2 NA’s :2 NA NA NA NA’s :9 NA’s :4 NA NA

A quick look at the summary tells us the inter-quartile ranges and maximum and minimum values for each variable.

Observations:

  1. There are also some NA values which we need to remove.

  2. There might be duplicates in the dataset.


CLEANING THE DATA

Removing Duplicates

wine <- unique(wine)
xkablesummary(wine)
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5329 Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.22 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16.0 1st Qu.: 75 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.43 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.30 Median :0.310 Median : 2.7 Median :0.047 Median : 28.0 Median :116 Median :0.995 Median :3.21 Median :0.51 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.34 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30.1 Mean :114 Mean :0.995 Mean :3.22 Mean :0.53 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.41 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41.0 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.60 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289.0 Max. :440 Max. :1.039 Max. :4.01 Max. :2.00 Max. :14.9 Max. :9.0
NA NA NA’s :10 NA’s :8 NA’s :3 NA’s :2 NA’s :2 NA NA NA NA’s :9 NA’s :4 NA NA

Duplicates can cause issues later with biases. Thus, we have removed the duplicate values.

Now, the dataset has 5329 observations.

Removing NA Values

wine <- na.omit(wine)
# str(wine)
xkablesummary(wine)
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5295 Min. : 3.80 Min. :0.080 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.220 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.230 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16 1st Qu.: 74 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.430 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.300 Median :0.310 Median : 2.7 Median :0.047 Median : 28 Median :116 Median :0.995 Median :3.21 Median :0.510 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.344 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30 Mean :114 Mean :0.995 Mean :3.22 Mean :0.533 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.410 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.600 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.580 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289 Max. :440 Max. :1.039 Max. :4.01 Max. :2.000 Max. :14.9 Max. :9.0

After, removing the NAs, we are finally left with 5295 observations. The dataset is now clean.

Now, the data is clean. We have removed the NAs, and duplicates. We can now start our exploratory data analysis.

Observations:

  1. There is a big difference between maximum value and Q3 values. This means that there are a quite a lot of outliers.

Let’s visualize this using plots.


UNIVARIATE PLOTS

To understand in depth as to which factor effects the quality the most.

Let’s look at the individual variables in the data set.

ggplot(data = wine, aes(x = quality)) +
  geom_bar(width = 0.8, color = 'black', fill = I('yellow')) +
  labs(
    title = "Overall Wine Quality",
    x = "Quality",
    y = "Data - Red & white wine"
  )

Observations:

  1. Wine quality shows a rather symmetrical distribution.

  2. Most wines have a quality score of 6.

  3. No wine achieved the highest score of 10 and the worst wines got a rating of 3.

Let’s see how the other factors are effecting!

p1 <- ggplot(data = wine, aes(x = citric.acid)) +
  geom_bar(fill = I('blue')) +
  labs(
    title = "Citric Acidity",
    x = "Concentration [g/L]",
    y = "Data"
  )

p2 <- ggplot(data = wine, aes(x = pH)) +
  geom_bar( fill = I('blue')) +
  labs(
    title = "pH",
    x = "pH",
    y = "Data"
  )
p3 <- ggplot(data = wine, aes(x = residual.sugar)) +
  geom_histogram(binwidth = 1,  fill = I('blue')) +
  labs(
    title = "Residual Sugar",
    x = "Residual Sugar (g/L)",
    y = "Data"
  )
p4 <- ggplot(data = wine, aes(x = density)) +
  geom_histogram(binwidth = 0.002,  fill = I('blue')) +
  labs(
    title = "Density",
    x = "Density",
    y = "Data"
  )
p5 <- ggplot(data = wine, aes(x = chlorides)) +
  geom_histogram(binwidth = 0.005,  fill = I('blue')) +
  labs(
    title = "Chlorides",
    x = "Chloride Content (g/L)",
    y = "Data"
  )
p6 <-  ggplot(data = wine, aes(x = alcohol)) +
  geom_histogram(binwidth = 1,  fill = I('blue')) +
  labs(
    title = "Alcohol Content",
    x = "Alcohol Content (% by volume)",
    y = "Data"
  )
grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)

p7 <- ggplot(data = wine, aes(x = fixed.acidity)) +
  geom_bar( fill = I('blue')) +
  labs(
    title = "Fixed Acidity",
    x = "TaOH Concentration [g/L]",
    y = "Data"
  )

 p8 <- ggplot(data = wine, aes(x = volatile.acidity)) +
  geom_bar(  fill = I('blue')) +
  labs(
    title = "Volatile Acidity",
    x = "AcOH Concentration [g/L]",
    y = "Data"
  )
 
 p9 <- ggplot(wine, aes(x = free.sulfur.dioxide)) +
  geom_histogram(binwidth = 5,  fill = I('blue')) +
  labs(
    title = "Free Sulfur Dioxide Concentration",
    x = "Concentration (mg/L)",
    y = "Data"
  )

 p10 <- ggplot(wine, aes(x = total.sulfur.dioxide)) +
  geom_histogram(binwidth = 20, fill = "blue") +
  labs(
    title = "Total Sulfur Dioxide Concentration",
    x = "Concentration (mg/L)",
    y = "Data"
  )

grid.arrange(p7,p8, p9,p10, nrow = 2)

Observations:

  1. Most distributions encountered during the exploration of the parameters looked rather usual. In general, they were positively skewed with a narrow main peak.

We will also have a look at at the box plots for these!

p1 <- ggplot(data = wine, aes(x = "", y = fixed.acidity )) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Fixed Acidity",
    y = "TaOH Concentration [g/L]"
  )

p2 <- ggplot(data = wine, aes(x = "", y = volatile.acidity)) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Volatile Acidity",
    y = "AcOH Concentration [g/L]"
  )

 p3 <- ggplot(data = wine, aes(x = "", y = citric.acid)) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Citric Acidity",
    y = "Concentration [g/L]"
  )

 p4 <- ggplot(data = wine, aes(x = "", y = pH))+
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "pH",
    y = "pH"
  )
grid.arrange(p1,p2, p3,p4, nrow = 1)

Observations:

  1. Residual sugar has a very long-tail distribution with many outliers. It will be interesting to see how these outliers affect the quality of wine.

  2. Chlorides have distribution similar to residual sugar and have a strong concentration around the median. We also note a lot of outliers from the box plot.

  3. Most wines have less than 11%alcohol.

  4. Density has a very normal looking distribution with most of the values falling between 0.995 and 1.


SUMMARY OF UNIVARIATE PLOTS

  1. In general, the variables were positively skewed with a narrow main peak.

  2. Most wines have a pH of 3.2. Since we have chlorides,citric acid, and fixed and volatile acidity, the wines were bound to be on the acidic side.

  3. The wines have an alcohol content ranging between 8 and 15 vol%.

Now let us see that how different factors are related to the quality!


CORRELATION MATRIX

First, we will build correlation matrix to identify the variables which influence quality the most.

numeric_data <- subset(wine, select = -c(type))
redd <- subset(wine, type == "red")
numeric_datared <- subset(redd, select = -c(type))
whited <- subset(wine, type == "white")
numeric_datawhite <- subset(whited, select = -c(type))
loadPkg("corrplot")
cor_matrix <-cor(numeric_data)
corrplot(cor_matrix, method="circle",type="upper")

numeric_data <- subset(wine, select = -c(type))

Observations:

Our Target Variable is Quality, so we will focus on only those parameters which influence quality.

  1. Alcohol and quality have a high positive correlation.

  2. Density and quality have a negative correlation.

  3. Low correlation between quality and chloride concentration.

  4. Wine quality also slightly negatively correlates with volatile acidity.

  5. It has a slight positive correlation with citric acid.

From the correlation matrix, let us now see how these parameters vary over different quality ratings for red and white wine separately and find out how they are different.


SUBSETTING DATA INTO RED AND WHITE WINEs

# Subset the dataset into red and white wines
red <- wine[wine$type == "red", ]
white <- wine[wine$type == "white", ]

We will now subset the data into white and red wine separately to go ahead with our analysis.


BIVARIATE PLOTS

Through our correlation plot we have understood that Alcohol content, density, citric acid and chloride are the ones that are affecting the quality the most, let us see how and also make a comparitive analysis between red and white wine individually!

1. ALCOHOL VS QUALITY

library(ggplot2)
ggplot(wine, aes(x = as.factor(quality), y = alcohol)) +
  geom_boxplot(fill = "brown", color = "darkblue") +
  labs(x = "Wine Quality", y = "Alcohol") +
  ggtitle("Box Plot of Alcohol vs Quality")

The boxplot shows that wines with higher quality seem to have a higher alcohol content.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in alcohol content.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the two wines.

t_test_alcohol <- t.test(red$alcohol,white$alcohol, level=0.05)

p <- t_test_alcohol$p.value

Observations:

  1. Since the p-value of 3.606^{-6} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol concentration between the two wines.

Let us check how it varies for red and white wine.

library(ggplot2)
red <- subset(wine, type == "red")
white <- subset(wine, type == "white")
ggplot() +
  
  geom_boxplot(data = red, aes(x = as.factor(quality), y = alcohol, fill = "red"), width = 0.4) +
  labs(x = "Wine Quality", y = "Alcohol", fill = "Wine Color") +
  ggtitle("Box Plot of Alcohol vs Quality for Red Wines") +
  scale_fill_manual(values = c("red" = "red"))

ggplot() +
  geom_boxplot(data = white, aes(x = as.factor(quality), y = alcohol, fill = "white"), width = 0.4) +
  labs(x = "Wine Quality", y = "Alcohol", fill = "Wine Color") +
  ggtitle("Box Plot of Alcohol vs Quality for White Wines") +
  scale_fill_manual(values = c("white" = "white"))

Observations:

  1. White Wines have higher alcohol content.

  2. Alcohol has a strong positive correlation with quality.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in red wine

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in red wine.

anova_result <- aov(alcohol ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
##                      Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(quality)    5    438    87.6     103 <2e-16 ***
## Residuals          1347   1144     0.8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)
summary(tukey_test)
##                    Length Class  Mode   
## as.factor(quality) 60     -none- numeric

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in white wine.

anova_result <- aov(alcohol ~  as.factor(quality), data = white)
print(summary(anova_result))
##                      Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(quality)    6   1467   244.5     220 <2e-16 ***
## Residuals          3935   4377     1.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)
tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = alcohol ~ as.factor(quality), data = white)
## 
## $`as.factor(quality)`
##       diff     lwr     upr p adj
## 4-3 -0.147 -0.8867  0.5934 0.997
## 5-3 -0.479 -1.1809  0.2223 0.405
## 6-3  0.304 -0.3957  1.0035 0.861
## 7-3  1.175  0.4692  1.8808 0.000
## 8-3  1.535  0.7882  2.2821 0.000
## 9-3  1.835  0.2794  3.3906 0.009
## 5-4 -0.333 -0.6009 -0.0644 0.005
## 6-4  0.451  0.1876  0.7134 0.000
## 7-4  1.322  1.0427  1.6006 0.000
## 8-4  1.682  1.3109  2.0527 0.000
## 9-4  1.982  0.5675  3.3957 0.001
## 6-5  0.783  0.6660  0.9003 0.000
## 7-5  1.654  1.5046  1.8040 0.000
## 8-5  2.014  1.7278  2.3011 0.000
## 9-5  2.314  0.9199  3.7087 0.000
## 7-6  0.871  0.7313  1.0111 0.000
## 8-6  1.231  0.9496  1.5130 0.000
## 9-6  1.531  0.1378  2.9245 0.020
## 8-7  0.360  0.0634  0.6568 0.006
## 9-7  0.660 -0.7365  2.0564 0.805
## 9-8  0.300 -1.1179  1.7176 0.996

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.


2. DENSITY VS QUALITY

library(ggplot2)
# Density Distribution by Wine Quality
ggplot(wine, aes(x = factor(quality), y = density)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "Quality", y = "Density") +
  ggtitle("Density Distribution by Wine Quality")+
  ylim(1, 1.001)  # Adjust the limits as needed

The boxplot shows that wines with higher quality seem to have a less denser.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in density level.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in density level.

t_test_density <- t.test(red$density,white$density, level=0.05)

p <- t_test_density$p.value
print(p)

[1] 6.92e-322

Observations:

  1. Since the p-value of 7^{-322} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean densitylevel between the two wines.

Let us check how it varies for red and white wine

# Box Plot of Density vs Quality for Red Wines
ggplot() +
  geom_boxplot(data = red, aes(x = as.factor(quality), y = density, fill = "light coral"), width = 0.4) +
  labs(x = "Wine Quality", y = "Density", fill = "light coral") +
  ggtitle("Box Plot of Density vs Quality for Red Wines") +
  scale_fill_manual(values = c("light coral" = "light coral"))

# Box Plot of Density vs Quality for White Wines
ggplot() +
  geom_boxplot(data = white, aes(x = as.factor(quality), y = density, fill = "red"), width = 0.4) +
  labs(x = "Wine Quality", y = "Density", fill = "red") +
  ggtitle("Box Plot of Density vs Quality for White Wines") +
  scale_fill_manual(values = c("red" = "red"))

Observations:

  1. Red Wines are more dense than white wines.

  2. Density has a negative correlation with quality.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in red wine.

anova_result <- aov(density ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
##                      Df  Sum Sq  Mean Sq F value Pr(>F)    
## as.factor(quality)    5 0.00021 4.27e-05    12.8  4e-12 ***
## Residuals          1347 0.00451 3.40e-06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)
tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = density ~ as.factor(quality), data = red, conf.level = 0.95)
## 
## $`as.factor(quality)`
##          diff       lwr       upr p adj
## 4-3 -0.000926 -0.002730  8.77e-04 0.686
## 5-3 -0.000380 -0.002047  1.29e-03 0.987
## 6-3 -0.000886 -0.002553  7.82e-04 0.654
## 7-3 -0.001413 -0.003114  2.88e-04 0.167
## 8-3 -0.002369 -0.004451 -2.87e-04 0.015
## 5-4  0.000546 -0.000210  1.30e-03 0.309
## 6-4  0.000041 -0.000718  8.00e-04 1.000
## 7-4 -0.000486 -0.001316  3.43e-04 0.550
## 8-4 -0.001442 -0.002902  1.73e-05 0.055
## 6-5 -0.000505 -0.000819 -1.91e-04 0.000
## 7-5 -0.001033 -0.001492 -5.73e-04 0.000
## 8-5 -0.001988 -0.003274 -7.03e-04 0.000
## 7-6 -0.000527 -0.000991 -6.39e-05 0.015
## 8-6 -0.001483 -0.002770 -1.96e-04 0.013
## 8-7 -0.000956 -0.002286  3.74e-04 0.314

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in white wine.

anova_result <- aov(density ~  as.factor(quality), data = white)
print(summary(anova_result))
##                      Df  Sum Sq  Mean Sq F value Pr(>F)    
## as.factor(quality)    6 0.00458 0.000764     105 <2e-16 ***
## Residuals          3935 0.02872 0.000007                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)

tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = density ~ as.factor(quality), data = white)
## 
## $`as.factor(quality)`
##          diff      lwr       upr p adj
## 4-3 -0.000695 -0.00259  1.20e-03 0.934
## 5-3  0.000182 -0.00161  1.98e-03 1.000
## 6-3 -0.001160 -0.00295  6.33e-04 0.475
## 7-3 -0.002824 -0.00463 -1.02e-03 0.000
## 8-3 -0.003139 -0.00505 -1.23e-03 0.000
## 9-3 -0.003424 -0.00741  5.61e-04 0.147
## 5-4  0.000877  0.00019  1.56e-03 0.003
## 6-4 -0.000465 -0.00114  2.09e-04 0.392
## 7-4 -0.002129 -0.00284 -1.41e-03 0.000
## 8-4 -0.002444 -0.00339 -1.49e-03 0.000
## 9-4 -0.002729 -0.00635  8.93e-04 0.283
## 6-5 -0.001342 -0.00164 -1.04e-03 0.000
## 7-5 -0.003006 -0.00339 -2.62e-03 0.000
## 8-5 -0.003321 -0.00406 -2.59e-03 0.000
## 9-5 -0.003606 -0.00718 -3.44e-05 0.046
## 7-6 -0.001664 -0.00202 -1.31e-03 0.000
## 8-6 -0.001979 -0.00270 -1.26e-03 0.000
## 9-6 -0.002264 -0.00583  1.30e-03 0.500
## 8-7 -0.000315 -0.00108  4.45e-04 0.885
## 9-7 -0.000600 -0.00418  2.98e-03 0.999
## 9-8 -0.000285 -0.00392  3.35e-03 1.000

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density level across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.


3. CHLORIDES VS QUALITY

ggplot(wine, aes(x = factor(quality), y = chlorides)) +
  geom_boxplot(fill = "lightcoral") +
  labs(x = "Quality", y = "Chlorides") +
  ggtitle("Chloride Distribution by Wine Quality") +
  ylim(0, 0.2)

The boxplot shows that wines with higher quality seem to have a less chlorides.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in chloride concentration.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in chloride concentration.

t_test_chloride <- t.test(red$chlorides,white$chlorides, level=0.05)

p <- t_test_chloride$p.value
print(p)

[1] 3.71e-159

Observations:s

  1. Since the p-value of 3.706^{-159} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration between the two wines.

Let us check how it varies for red and white wine

ggplot(red, aes(x = factor(quality), y = chlorides)) +
  geom_boxplot(fill = "yellow") +
  labs(x = "Quality", y = "Chlorides") +
  ggtitle("Chloride Distribution by Wine Quality (Red Wine)") +
  ylim(0, 0.3)

ggplot(white, aes(x = factor(quality), y = chlorides)) +
  geom_boxplot(fill = "coral") +
  labs(x = "Quality", y = "Chlorides") +
  ggtitle("Chloride Distribution by White Wine Quality") +
  ylim(0, 0.2)

Observations:

  1. Red Wines have more chloride concentration than white wines.

  2. Chloride Concentration has a slight negative correlation with quality

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in red wine.

anova_result <- aov(chlorides ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## as.factor(quality)    5   0.06 0.01286    5.34 7.2e-05 ***
## Residuals          1347   3.24 0.00241                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)
tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = chlorides ~ as.factor(quality), data = red, conf.level = 0.95)
## 
## $`as.factor(quality)`
##         diff     lwr       upr p adj
## 4-3 -0.03244 -0.0808  1.59e-02 0.393
## 5-3 -0.02851 -0.0732  1.62e-02 0.452
## 6-3 -0.03730 -0.0820  7.40e-03 0.164
## 7-3 -0.04567 -0.0913 -7.67e-05 0.049
## 8-3 -0.05415 -0.1100  1.66e-03 0.063
## 5-4  0.00394 -0.0163  2.42e-02 0.994
## 6-4 -0.00485 -0.0252  1.55e-02 0.984
## 7-4 -0.01323 -0.0355  9.01e-03 0.534
## 8-4 -0.02170 -0.0608  1.74e-02 0.610
## 6-5 -0.00879 -0.0172 -3.65e-04 0.035
## 7-5 -0.01716 -0.0295 -4.85e-03 0.001
## 8-5 -0.02564 -0.0601  8.82e-03 0.276
## 7-6 -0.00837 -0.0208  4.05e-03 0.389
## 8-6 -0.01685 -0.0514  1.77e-02 0.731
## 8-7 -0.00848 -0.0441  2.72e-02 0.984

Observations

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant differnece in chloride concentration.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in white wine.

anova_result <- aov(chlorides ~  as.factor(quality), data = white)
print(summary(anova_result))
##                      Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(quality)    6  0.114  0.0190    37.7 <2e-16 ***
## Residuals          3935  1.987  0.0005                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)

tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = chlorides ~ as.factor(quality), data = white)
## 
## $`as.factor(quality)`
##          diff      lwr       upr p adj
## 4-3 -0.004162 -0.01993  0.011603 0.987
## 5-3 -0.001961 -0.01691  0.012985 1.000
## 6-3 -0.009171 -0.02407  0.005733 0.538
## 7-3 -0.016760 -0.03180 -0.001725 0.018
## 8-3 -0.017544 -0.03346 -0.001633 0.020
## 9-3 -0.026900 -0.06004  0.006239 0.201
## 5-4  0.002201 -0.00351  0.007915 0.917
## 6-4 -0.005009 -0.01061  0.000592 0.115
## 7-4 -0.012598 -0.01854 -0.006655 0.000
## 8-4 -0.013382 -0.02128 -0.005481 0.000
## 9-4 -0.022738 -0.05286  0.007386 0.281
## 6-5 -0.007210 -0.00970 -0.004714 0.000
## 7-5 -0.014799 -0.01799 -0.011609 0.000
## 8-5 -0.015583 -0.02169 -0.009476 0.000
## 9-5 -0.024939 -0.05464  0.004765 0.168
## 7-6 -0.007589 -0.01057 -0.004609 0.000
## 8-6 -0.008373 -0.01437 -0.002373 0.001
## 9-6 -0.017729 -0.04741  0.011953 0.574
## 8-7 -0.000784 -0.00710  0.005536 1.000
## 9-7 -0.010140 -0.03989  0.019609 0.953
## 9-8 -0.009356 -0.03956  0.020846 0.970

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in chloride concentration.


4. CITRIC ACID VS QUALITY

library(ggplot2)
ggplot(wine, aes(x = factor(quality), y = citric.acid)) +
  geom_boxplot(fill = "lightpink") +
  labs(x = "Quality", y = "Citric Acid") +
  ggtitle("Citric Acid Distribution by Wine Quality")+
 ylim(0.0,0.15)

The boxplot shows that wines with higher quality seem to have a high citric acid.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in citric acid concentration.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in citric acid concentration.

t_test_citric <- t.test(red$citric.acid,white$citric.acid, level=0.05)

p <- t_test_citric$p.value
print(p)

[1] 1.33e-26

Observations:

  1. Since the p-value of 1.335^{-26} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid concentration between the two wines.

Let us check how it varies for red and white wine

ggplot(red, aes(x = factor(quality), y = citric.acid)) +
  geom_boxplot(fill = "green") +
  labs(x = "Quality", y = "Citric Acid") +
  ggtitle("Citric Acid Distribution by Red Wine Quality")+
  ylim(0.0,0.15)

  ggplot(white, aes(x = factor(quality), y = citric.acid)) +
  geom_boxplot(fill = "maroon") +
  labs(x = "Quality", y = "Citric Acid") +
  ggtitle("Citric Acid Distribution by White Wine Quality")+
  ylim(0.0,0.15)

Observations:

  1. White Wines have more citric acid concentration than red wines.

  2. Citric ACid Concentration has a slight positive correlation with quality.

  3. There isn’t much difference in citric acid concentration in white wines across the quality ratings.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in red wine.

anova_result <- aov(citric.acid ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## as.factor(quality)    5    2.9   0.584    16.1 1.8e-15 ***
## Residuals          1347   48.8   0.036                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)
tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = citric.acid ~ as.factor(quality), data = red, conf.level = 0.95)
## 
## $`as.factor(quality)`
##        diff      lwr    upr p adj
## 4-3 0.00477 -0.18279 0.1923 1.000
## 5-3 0.07377 -0.09948 0.2470 0.830
## 6-3 0.10949 -0.06389 0.2829 0.465
## 7-3 0.20086  0.02402 0.3777 0.015
## 8-3 0.21194 -0.00453 0.4284 0.059
## 5-4 0.06901 -0.00965 0.1477 0.124
## 6-4 0.10472  0.02579 0.1836 0.002
## 7-4 0.19609  0.10983 0.2823 0.000
## 8-4 0.20717  0.05542 0.3589 0.001
## 6-5 0.03572  0.00304 0.0684 0.023
## 7-5 0.12708  0.07934 0.1748 0.000
## 8-5 0.13817  0.00450 0.2718 0.038
## 7-6 0.09137  0.04318 0.1396 0.000
## 8-6 0.10245 -0.03138 0.2363 0.246
## 8-7 0.01108 -0.12720 0.1494 1.000

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in citric acid level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in white wine.

anova_result <- aov(citric.acid ~  as.factor(quality), data = white)
print(summary(anova_result))
##                      Df Sum Sq Mean Sq F value Pr(>F)  
## as.factor(quality)    6    0.2  0.0363    2.43  0.024 *
## Residuals          3935   58.8  0.0150                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_test <- TukeyHSD(anova_result)

tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = citric.acid ~ as.factor(quality), data = white)
## 
## $`as.factor(quality)`
##          diff       lwr     upr p adj
## 4-3 -0.031789 -0.117583 0.05400 0.930
## 5-3  0.000279 -0.081059 0.08162 1.000
## 6-3  0.002416 -0.078687 0.08352 1.000
## 7-3 -0.008569 -0.090389 0.07325 1.000
## 8-3 -0.000962 -0.087551 0.08563 1.000
## 9-3  0.050000 -0.130341 0.23034 0.983
## 5-4  0.032068  0.000969 0.06317 0.038
## 6-4  0.034205  0.003727 0.06468 0.016
## 7-4  0.023220 -0.009118 0.05556 0.342
## 8-4  0.030828 -0.012172 0.07383 0.344
## 9-4  0.081789 -0.082144 0.24572 0.762
## 6-5  0.002137 -0.011441 0.01572 0.999
## 7-5 -0.008848 -0.026203 0.00851 0.743
## 8-5 -0.001241 -0.034472 0.03199 1.000
## 9-5  0.049721 -0.111925 0.21137 0.972
## 7-6 -0.010985 -0.027202 0.00523 0.416
## 8-6 -0.003378 -0.036030 0.02927 1.000
## 9-6  0.047584 -0.113944 0.20911 0.977
## 8-7  0.007608 -0.026787 0.04200 0.995
## 9-7  0.058569 -0.103320 0.22046 0.938
## 9-8  0.050962 -0.113390 0.21531 0.970

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid level across all categories of red wine.

SUMMARY OF BIVARIATE PLOTS

From the bivariate plots, we have concluded that :

  1. For good quality wines, the alcohol content is more.

  2. As the density decreases, the quality gets better.

  3. As the chloride concentration decreases, quality gets better.

  4. For better wines, the citric acid concentration is more.

  5. Whites wines generally have less density and have more alcohol conent.


MULTIVARIATE PLOTS

For the last part of our EDA, we will perform some multivariate plots to see some how the other non-important features in wine are distributed in red and white wine.

library(ggplot2)
ggplot(wine, aes(x = residual.sugar, y = density, color = factor(type))) +
  geom_point() +ggtitle("Scatter plot for Density vs Sugar ") +
  labs(x = "Sugar", y = "Density", color = "Wine Type")

ggplot(wine, aes(x = alcohol, y = density, color = factor(type))) +
  geom_point() +  ggtitle("Scatter plot for Density vs Alcohol ") +
  labs(x = "Alcohol", y = "Density", color = "Wine Type")

 ggplot(wine, aes(x = alcohol, y = chlorides, color = factor(type))) +
  geom_point() + ggtitle("Scatter plot for Chlorides vs Alcohol ") +
  labs(x = "Alcohol", y = "Chlorides", color = "Wine Type")

ggplot(wine, aes(x = sulphates, y = residual.sugar, color = factor(type))) +
  geom_point() +ggtitle("Scatter plot for Sulphates vs Sugar ") +
  labs(x = "Sulphates", y = "Residual Sugar", color = "Wine Type")

Observations:

  1. White Wines have more sugar concentration than red wines. This might explain why white wines are usually more sweet.

  2. Red Wines have more sulphate concentration.


CONCLUSIONS FROM EDA

In our preliminary analysis, we have uncovered some key insights:

  1. Wines with elevated alcohol content, increased citric acid levels, lower density, and reduced chlorides tend to exhibit higher quality.

  2. It appears that white wines, in general, tend to be sweeter and have higher alcohol content when compared to their red counterparts.


MODELLING THE DATA

From our EDA, we have identified some key features which influence the target variable- Quality.

As part of our next analysis, we will now perform some regression models to see which model predicts the quality of wine with most accuracy.


FEATURE SELECTION

loadPkg("leaps")
reg.best2 <- regsubsets(quality~., data = numeric_data, nbest = 2, method = "exhaustive") 
plot(reg.best2, scale = "adjr2", main = "Adjusted R^2")

plot(reg.best2, scale = "r2", main = "R^2")

plot(reg.best2, scale = "bic", main = "BIC")

plot(reg.best2, scale = "Cp", main = "Cp")

We can see from the results of exhaustive forward method of feature selection, the best $ adjusted R^2$ score and lowest BIC score are for these 7 features.

  1. volatile.acidity

  2. free.sulfur.dioxide

  3. total.sulfur.dioxide

  4. pH

  5. sulphates

  6. alcohol

  7. residual.sugar

Going forward, we will use these 7 variables for our analysis.

Our target variable is Quality.

1. LINEAR REGRESSION

model1 = lm(quality ~ volatile.acidity+residual.sugar+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol, data=numeric_data)
sum_md1 = summary(model1) # also for later use on inline codes
sum_md1
## 
## Call:
## lm(formula = quality ~ volatile.acidity + residual.sugar + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = numeric_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.992 -0.450 -0.024  0.462  3.144 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.391874   0.240127    5.80  7.2e-09 ***
## volatile.acidity     -1.390559   0.068586  -20.27  < 2e-16 ***
## residual.sugar        0.017460   0.002675    6.53  7.3e-11 ***
## free.sulfur.dioxide   0.006692   0.000825    8.11  6.4e-16 ***
## total.sulfur.dioxide -0.002224   0.000285   -7.81  6.7e-15 ***
## pH                    0.273930   0.066893    4.10  4.3e-05 ***
## sulphates             0.620295   0.071304    8.70  < 2e-16 ***
## alcohol               0.344362   0.009197   37.44  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.734 on 5287 degrees of freedom
## Multiple R-squared:  0.305,  Adjusted R-squared:  0.304 
## F-statistic:  331 on 7 and 5287 DF,  p-value: <2e-16
predicted_values <- predict(model1, newdata = numeric_data)

# Calculate residuals (the differences between predicted and actual values)
residuals <- predicted_values - numeric_data$quality

# Calculate Mean Squared Error (MSE)
mse <- mean(residuals^2)
print(sqrt(0.53))
## [1] 0.728
cat("MSE",mse,"\n")
## MSE 0.538
  1. The p-value for all the independent variables are less than 0.05. Thus, all are significant.

  2. The model explains 30% of variance.

  3. MSE is around 0.53


2. LOGISTIC REGRESSION

For Logistic Regression, we want the outcome variable to be categorical. Hence,w e are converting all the wines from 0 to 6 to “Bad Wines” and from 6-10 as “Good Wines”.

wine <- wine %>%
  mutate(quality_category = ifelse(quality < 6, 0, 1))
head(wine)
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality quality_category
1 white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6 1
2 white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6 1
3 white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6 1
4 white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6 1
7 white 6.2 0.32 0.16 7.0 0.045 30 136 0.995 3.18 0.47 9.6 6 1
10 white 8.1 0.22 0.43 1.5 0.044 28 129 0.994 3.22 0.45 11.0 6 1
winelogit1 <- glm(quality_category ~volatile.acidity+residual.sugar+pH+sulphates+alcohol+free.sulfur.dioxide+total.sulfur.dioxide, data = wine, binomial(link = "logit"))
summary(winelogit1)
## 
## Call:
## glm(formula = quality_category ~ volatile.acidity + residual.sugar + 
##     pH + sulphates + alcohol + free.sulfur.dioxide + total.sulfur.dioxide, 
##     family = binomial(link = "logit"), data = wine)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.17e+01   8.29e-01  -14.07  < 2e-16 ***
## volatile.acidity     -4.28e+00   2.45e-01  -17.45  < 2e-16 ***
## residual.sugar        5.22e-02   8.64e-03    6.04  1.5e-09 ***
## pH                    8.50e-01   2.24e-01    3.80  0.00015 ***
## sulphates             1.89e+00   2.48e-01    7.64  2.1e-14 ***
## alcohol               9.57e-01   3.71e-02   25.78  < 2e-16 ***
## free.sulfur.dioxide   1.84e-02   2.80e-03    6.59  4.5e-11 ***
## total.sulfur.dioxide -7.04e-03   9.27e-04   -7.59  3.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6999.2  on 5294  degrees of freedom
## Residual deviance: 5444.6  on 5287  degrees of freedom
## AIC: 5461
## 
## Number of Fisher Scoring iterations: 4
exp(coef(winelogit1))
##          (Intercept)     volatile.acidity       residual.sugar 
##             8.53e-06             1.39e-02             1.05e+00 
##                   pH            sulphates              alcohol 
##             2.34e+00             6.65e+00             2.60e+00 
##  free.sulfur.dioxide total.sulfur.dioxide 
##             1.02e+00             9.93e-01

From the logistic regression model we can see that :

  1. All variables are statistically significant as the p-value is less than 0.05.

  2. the log odds and odds ratio both tell us how the variables will change with every 1 unit change.

CONFUSION MATRIX

loadPkg(regclass)
conf_matrix = confusion_matrix(winelogit1)

unloadPkg(regclass)
xkabledply(conf_matrix,"Confusion Matrix: Logit model, cutoff = 0.6")
Confusion Matrix: Logit model, cutoff = 0.6
Predicted 0 Predicted 1 Total
Actual 0 1185 794 1979
Actual 1 530 2786 3316
Total 1715 3580 5295
true_positives <- 2786
true_negatives <- 1185
false_positives <-794
false_negatives <-530
total <- 5295

# Calculate accuracy
accuracy <- (true_positives + true_negatives) / total
cat("Accuracy of logistic regression model is", accuracy,"\n")

Accuracy of logistic regression model is 0.75

precision <- true_positives / (true_positives + false_positives)
cat(" Precision of logistic regression model is",  precision, "\n")

Precision of logistic regression model is 0.778

The accuracy and precision from the confusion matrix tells us that the model is a good fit.

We will also check it from the ROC-AUC curve as well.

ROC-AUC CURVE

#install.packages("pROC")

loadPkg(pROC) # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
wine$prob=predict(winelogit1, type = c("response")) # Add new column of predicted probabilities
h <- roc(quality_category~prob, data=wine)
auc(h) # area-under-curve prefer 0.8 or higher.

Area under the curve: 0.808

plot(h)

#unloadPkg(pROC) 

The ROC-AUC curve value is 0.808. From our domain knowledge, we know that the model fits well when the value is between 0.8 to 1. Thus, our model is a good fit.

To test this hypothesis, we will also conduct a test.

HOSMER & LEMESHOW TEST

#### Hosmer and Lemeshow test  
########The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit. 

loadPkg(ResourceSelection) # function hoslem.test( ) for logit model evaluation
winelogit1_Hoslem = hoslem.test(wine$quality_category, fitted(winelogit1)) # Hosmer and Lemeshow test, a chi-squared test
unloadPkg(ResourceSelection) 
winelogit1_Hoslem
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  wine$quality_category, fitted(winelogit1)
## X-squared = 11, df = 8, p-value = 0.2

The p-value is greater than 0.05. So, we fail to reject the null hypothesis and conclude that the model is a good fit.

McFaDDEN TEST

# We can try this just to confirm the following calculation
wine_nulllogit1 <- glm(quality_category ~ 1, data = wine, family = "binomial")
mcFadden = 1 - logLik(winelogit1)/logLik(wine_nulllogit1)
mcFadden
## 'log Lik.' 0.222 (df=8)
loadPkg(pscl) # use pR2( ) function to calculate McFadden statistics for model eval
wineLogit1pr2 = pR2(winelogit1)
## fitting null model for pseudo-r2
wineLogit1pr2
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -2722.313 -3499.575  1554.526     0.222     0.254     0.347
unloadPkg(pscl) 

3. KNN

loadPkg("class")
library(class)
set.seed(123)

sample_indices <- sample(1:nrow(wine), 0.7 * nrow(wine))

train_data <- wine[sample_indices, ]
test_data <- wine[-sample_indices, ]

library(class)

# Define predictors and response variable
predictors <- names(wine)[2:13]  # Adjust column indices based on your data
response <- "quality"
train_data$quality_category <- as.factor(train_data$quality_category)


k <- 5  # You can adjust the value of k
knn_model <- knn(train = train_data[, predictors], test = test_data[, predictors], cl = train_data$quality_category, k = k)


confusion_matrix <- table(Actual = test_data$quality_category, Predicted = knn_model)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <-confusion_matrix[2,2]/sum(diag(confusion_matrix))
confusion_matrix
##       Predicted
## Actual   0   1
##      0 329 269
##      1 183 808
# Display the confusion matrix and accuracy
print(confusion_matrix)
##       Predicted
## Actual   0   1
##      0 329 269
##      1 183 808
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.716
cat("Precision :", precision, "\n")
## Precision : 0.711
str(knn_model)
##  Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 2 ...
length(knn_model)
## [1] 1589
table(knn_model)
## knn_model
##    0    1 
##  512 1077
library(caret)
set.seed(123)
cv <- trainControl(method = "cv", number = 10)
model <- train(quality_category ~ ., data = train_data, method = "knn", trControl = cv)
model
## k-Nearest Neighbors 
## 
## 3706 samples
##   14 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3334, 3335, 3335, 3336, 3336, 3335, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy  Kappa
##   5  0.712     0.361
##   7  0.711     0.357
##   9  0.706     0.340
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

We first conducted a cross validation test to see which k-value gives the bestaccuracy. We found out that for k=5, we get the best accuracy.


4. DECISION TREES


library(rpart)
set.seed(123) 
trainIndex <- sample(nrow(wine), 0.9 * nrow(wine))
trainData <- wine[trainIndex, ]
testData <- wine[-trainIndex, ]

tree_model <- rpart(quality ~ fixed.acidity + volatile.acidity + residual.sugar + free.sulfur.dioxide + density + pH + sulphates + alcohol, data = trainData)

print(summary(tree_model))
## Call:
## rpart(formula = quality ~ fixed.acidity + volatile.acidity + 
##     residual.sugar + free.sulfur.dioxide + density + pH + sulphates + 
##     alcohol, data = trainData)
##   n= 4765 
## 
##       CP nsplit rel error xerror   xstd
## 1 0.1665      0     1.000  1.000 0.0220
## 2 0.0338      1     0.834  0.844 0.0207
## 3 0.0321      2     0.800  0.811 0.0206
## 4 0.0136      3     0.768  0.789 0.0199
## 5 0.0100      4     0.754  0.783 0.0197
## 
## Variable importance
##             alcohol             density    volatile.acidity           sulphates 
##                  49                  25                  12                   5 
##       fixed.acidity      residual.sugar free.sulfur.dioxide                  pH 
##                   4                   2                   2                   1 
## 
## Node number 1: 4765 observations,    complexity param=0.166
##   mean=5.8, MSE=0.781 
##   left son=2 (2767 obs) right son=3 (1998 obs)
##   Primary splits:
##       alcohol             < 10.6  to the left,  improve=0.1660, (0 missing)
##       density             < 0.992 to the right, improve=0.1030, (0 missing)
##       volatile.acidity    < 0.458 to the right, improve=0.0519, (0 missing)
##       free.sulfur.dioxide < 18.5  to the left,  improve=0.0223, (0 missing)
##       sulphates           < 0.685 to the left,  improve=0.0132, (0 missing)
##   Surrogate splits:
##       density        < 0.993 to the right, agree=0.790, adj=0.500, (0 split)
##       sulphates      < 0.405 to the right, agree=0.607, adj=0.062, (0 split)
##       fixed.acidity  < 6.05  to the right, agree=0.603, adj=0.054, (0 split)
##       residual.sugar < 5.28  to the right, agree=0.586, adj=0.014, (0 split)
##       pH             < 3.54  to the left,  agree=0.583, adj=0.007, (0 split)
## 
## Node number 2: 2767 observations,    complexity param=0.0338
##   mean=5.49, MSE=0.553 
##   left son=4 (1498 obs) right son=5 (1269 obs)
##   Primary splits:
##       volatile.acidity    < 0.282 to the right, improve=0.08210, (0 missing)
##       alcohol             < 10.1  to the left,  improve=0.03980, (0 missing)
##       free.sulfur.dioxide < 24.5  to the left,  improve=0.02160, (0 missing)
##       density             < 0.995 to the right, improve=0.00820, (0 missing)
##       sulphates           < 0.675 to the left,  improve=0.00746, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide < 29.5  to the left,  agree=0.654, adj=0.246, (0 split)
##       density             < 0.995 to the right, agree=0.634, adj=0.203, (0 split)
##       sulphates           < 0.505 to the right, agree=0.628, adj=0.190, (0 split)
##       residual.sugar      < 4.15  to the left,  agree=0.625, adj=0.182, (0 split)
##       fixed.acidity       < 6.85  to the right, agree=0.598, adj=0.124, (0 split)
## 
## Node number 3: 1998 observations,    complexity param=0.0321
##   mean=6.22, MSE=0.786 
##   left son=6 (1100 obs) right son=7 (898 obs)
##   Primary splits:
##       alcohol             < 11.7  to the left,  improve=0.0760, (0 missing)
##       free.sulfur.dioxide < 11.5  to the left,  improve=0.0442, (0 missing)
##       volatile.acidity    < 0.587 to the right, improve=0.0405, (0 missing)
##       density             < 0.991 to the right, improve=0.0311, (0 missing)
##       fixed.acidity       < 7.05  to the right, improve=0.0117, (0 missing)
##   Surrogate splits:
##       density          < 0.991 to the right, agree=0.712, adj=0.359, (0 split)
##       volatile.acidity < 0.282 to the left,  agree=0.582, adj=0.069, (0 split)
##       sulphates        < 0.365 to the right, agree=0.573, adj=0.050, (0 split)
##       fixed.acidity    < 5.95  to the right, agree=0.571, adj=0.046, (0 split)
##       residual.sugar   < 13.6  to the left,  agree=0.553, adj=0.006, (0 split)
## 
## Node number 4: 1498 observations
##   mean=5.3, MSE=0.443 
## 
## Node number 5: 1269 observations
##   mean=5.72, MSE=0.583 
## 
## Node number 6: 1100 observations,    complexity param=0.0136
##   mean=6, MSE=0.768 
##   left son=12 (174 obs) right son=13 (926 obs)
##   Primary splits:
##       volatile.acidity    < 0.455 to the right, improve=0.0600, (0 missing)
##       free.sulfur.dioxide < 11.5  to the left,  improve=0.0533, (0 missing)
##       pH                  < 3.5   to the right, improve=0.0157, (0 missing)
##       sulphates           < 0.675 to the left,  improve=0.0154, (0 missing)
##       fixed.acidity       < 7.05  to the right, improve=0.0129, (0 missing)
##   Surrogate splits:
##       pH            < 3.5   to the right, agree=0.853, adj=0.069, (0 split)
##       density       < 1     to the right, agree=0.847, adj=0.034, (0 split)
##       fixed.acidity < 14.9  to the right, agree=0.844, adj=0.011, (0 split)
## 
## Node number 7: 898 observations
##   mean=6.49, MSE=0.675 
## 
## Node number 12: 174 observations
##   mean=5.51, MSE=0.79 
## 
## Node number 13: 926 observations
##   mean=6.09, MSE=0.709 
## 
## n= 4765 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4765 3720 5.80  
##    2) alcohol< 10.6 2767 1530 5.49  
##      4) volatile.acidity>=0.282 1498  664 5.30 *
##      5) volatile.acidity< 0.282 1269  740 5.72 *
##    3) alcohol>=10.6 1998 1570 6.22  
##      6) alcohol< 11.7 1100  845 6.00  
##       12) volatile.acidity>=0.455 174  137 5.51 *
##       13) volatile.acidity< 0.455 926  657 6.09 *
##      7) alcohol>=11.7 898  606 6.49 *
library(rpart.plot)
rpart.plot(tree_model, digits = 4, extra = 1)

  1. The root node splits the full dataset based on the alcohol content of the wines, with a split point of 10.6. Wines with alcohol < 10.6 go to the left branch and wines with alcohol >= 10.6 go to the right branch.

  2. The left branch then further splits on volatile acidity being < or >= 0.282. This results in two leaf nodes predicting quality scores of 5.30 and 5.72 for the low and high volatility acidity wines.

  3. The right branch splits again on alcohol content < or >= 11.7. The left sub-branch splits once more on volatile acidity to generate two leaf nodes predicting quality of 5.51 and 6.09.

  4. The right-most leaf node predicts a quality score of 6.49 for wines with the highest alcohol content >= 11.7.

  5. So in total it has generated 5 leaf nodes predicting quality scores ranging from 5.30 to 6.49 based on 3 split points on alcohol content and 1 split point on volatile acidity.


Results

  1. Wines with higher alcohol content, increased citric acid levels, lower density, and less chlorides tend to exhibit higher quality.

  2. It appears that white wines, in general, tend to be sweeter and have higher alcohol content when compared to their red counterparts.

  3. Red Wines have more concentration of sulphates and chlorides.

  4. Logistic Regression was the best model to predict wine quality.

  5. From feature selection we found out that Volatile Acidity, Residual Sugar, Sulphur Concentration, Alcohol, and pH are the most important attributes while predicting wine quality.

  6. Higher Alcohol content, higher citric acid, less dense, and less chlorides make for better wine.

  7. White wines are sweeter than red wines because of higher sugar content.

  8. Initially we found out that citric acid and chlorides influence the quality but, after modelling, we found out that volatile acidity and sulphur concentration influence it more.


Limitations and Future scope

Although we thoroughly believe in our analysis, we have to mention a few anomalies that are present that may or may not have influenced the results.

  1. The number of data that we have on white wine is comparatively more than that of red wine.

  2. We have also observed that most of the data present are in the average quality range i.e, from 4-8.

  3. Classification algorithms are also available to group similar quality wines.

A MORE BALANCED SET OF DATA WILL IMPROVE OUR ANALYSIS.


Conclusions

Through our exploration of wine quality we have come across some discoveries that challenge wine making beliefs. While alcohol content, sugar levels and acidity are known factors we have found that attributes, like density and chlorides also play roles in shaping the quality of wine. These elements contribute in ways that captivate our senses and define what truly makes a bottle of wine.

Moreover, we have delved into the contrast between white wines going beyond mere color differences to uncover deeper variations in chemical composition and sensory profiles. By understanding these distinctions we empower wine enthusiasts and consumers to make choices based on their personal preferences and specific occasions. With the support of tests such as the T Test to check if there is a significant difference between the factors for red and white wine and ANOVA Tests to find how these factors vary across different quality ratings in red and white wine our findings are substantiated by evidence establishing a strong framework for comprehending the intricate interplay of variables that influence wine quality.

From our modelling techniques, we also found out which features actually help in predicting the quality of wine.

Equipped with this knowledge let us raise our glasses to the fusion of data science and wine making artistry. We also hope that one uses this knowledge to buy great wines.


Additional Insights

Even if our present investigation has been fascinating, we understand that the quest to predict wine quality is a dynamic endeavour. Our commitment to gaining deeper insights is demonstrated by our efforts to include a wider range of attributes that affect wine quality to our dataset. This growth is a calculated step towards a more comprehensive knowledge of the complex process of wine making, not just an increase in data. We want to build a predictive model that goes into the nuances of winemaking and captures every detail that contributes to wine greatness by incorporating a wider variety of variables.

We hope to improve our predictive model in this ongoing investigation to deliver even more thorough and accurate results.